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PROCESSOR AND METHOD OF EXECUTING LOAD INSTRUCTIONS OUT- 
OF-ORDER HAVING REDUCED HAZARD PENALTY 

5 BACKGROUND OF THE INVENTION 

1. Technical Field: 

io The present invention relates in general to 

data processing and, in particular, to the execution of 
load instructions by a processor. Still more 
particularly, the present invention relates to a 
^ processor that buffers load data for out-of-order load 

13J1 instructions in order to reduce the performance penalty 

ft associated with data hazards. 

2. Description of the Related Art: 

2oM A typical superscalar processor can comprise, 

ry for example, an instruction cache for storing 

jjf instructions, one or more execution units for executing 

H sequential instructions, a branch unit for executing 

branch instructions, instruction sequencing logic for 
25 routing instructions to the various execution units, and 

registers for storing operands and result data. In order 
to leverage the parallel execution capabilities of these 
multiple execution units, some superscalar processors 
support out-of-order execution, that is, the execution of 
30 instructions in a different order than the programmed 

sequence . 

When executing instructions out-of-order, it is 
essential for correctness that the processor produce the 
35 same execution results that would have been produced had 

the instructions been executed in the programmed 
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sequence. For example, given the following sequence of 
instructions : 

L0AD1 
5 ADD 

STORE 

L0AD2 

10 where LOAD1 and LOAD 2 target the same address and LOAD1 

precedes LOAD 2 in program order, LOAD 2 cannot be 
permitted to receive older data than L0AD1 . However, if 
LOAD 2 is executed prior to (i.e., out-of-order with 
respect to) L0AD1, LOAD 2 may receive older data than 
i^j] L0AD1 if the intervening STORE is targeted at the same 

tfj address or if another processor within the same computer 

system stores to the same address. A scenario in which 
=P an out-of-order executed load instruction receives 

^« incorrect data is defined herein to be a data hazard. 

2 m 

JX Superscalar processors that support out -of - 

flj order execution of load instructions typically detect and 

Iz correct for data hazards by implementing a load queue 

O that stores the target address of each load instruction 

25 that was executed out-of-order. Following execution of 

the out-of-order load instruction, addresses of exclusive 
transactions (e.g., read-with- intent -to -modify or kill) 
driven on the computer system interconnect by other 
processors, as well as store instructions preceding the 
load instruction that are initiated by the processor 
itself, are snooped against the entries within the load 
queue. If a snooped exclusive transaction or a local 
store operation hits within the load queue, the entry is 
marked, for example, by setting a flag. 



3 0 



35 



Thereafter, when the processor executes a load 
instruction, the processor determines whether or not the 
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load instruction precedes the out-of-order load 
instruction in program order and whether or not the 
subsequently executed load instruction targets an address 
specified in a marked entry in the load queue. If so, a 
data hazard is detected, and the processor flushes and 
re-executes at least both load instructions, and possibly 
all instructions in flight following the first of the two 
load instruction in program order. Flushing and re- 
executing instructions in this manner to remedy data 
hazards results in a significant performance penalty, 
particularly for processors having wide instruction 
execution windows. 
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SUMMARY OF THE INVENTION 

The present invention reduces the performance 
penalty associated with data hazards resulting from the 
5 out-of-order execution of load instructions by 

implementing an improved load queue within a processor. 

In accordance with the present invention, a 
processor having a reduced data hazard penalty includes a 
10 register set, at least one execution unit that executes 

load instructions to transfer data into the register set, 
O and a load queue. The load queue contains at least one 

% entry, and each occupied entry in the load queue stores 

l° ad data retrieved by an executed load instruction in 
lSj association with a target address of the executed load 

y instruction. The load queue has associated queue 

* management logic that, in response to execution by the 

q execution unit of a load instruction, determines by 

31 reference to the load queue whether a data hazard exists 

20 [^ for the lc =>ad instruction. If so, the queue management 

g logic outputs load data from the load queue to the 

U register set in accordance with the load instruction, 

thus eliminating the need to flush and re-execute the 
load instruction. 

25 

All objects, features, and advantages of the 
present invention will become apparent in the following 
detailed written description. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The novel features believed characteristic of 
the invention are set forth in the appended claims. The 
invention itself however, as well as a preferred mode of 
use, further objects and advantages thereof, will best be 
understood by reference to the following detailed 
description of an illustrative embodiment when read in 
conjunction with the accompanying drawings, wherein: 

Figure 1 depicts an illustrative embodiment of 
a data processing system with which the method and system 
of the present invention may advantageously be utilized; 

Figure 2 is a block diagram of an exemplary 

embodiment of the load data queue (LDQ) illustrated in 
Figure 1; 

Figure 3A is a high level logical flowchart of 
an exemplary method by which the queue management logic 
shown in Figure 2 updates the LDQ in response to various 
stages in the processing of local load operations; 

Figure 3B is a high level logical flowchart of 
an exemplary method by which the queue management logic 
of Figure 2 manages the LDQ in response to detection of 

local store operations and remote exclusive operations; 
and 

Figures 4A-4C are three views of LDQ 114 that 
together illustrate an exemplary operating scenario in 
which a data hazard caused by a remote exclusive 
operation is detected and corrected in accordance with 
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the present invention. 
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DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT 

With reference now to the figures and in 
particular with reference to Figure 1, there is depicted 

a high level block diagram of an illustrative embodiment 
of a processor, indicated generally at 10, for processing 

instructions and data in accordance with the present 
invention. In particular, processor 10 reduces data 

hazard penalties by implementing a load data queue that 
buffers load data associated with out-of-order load 
instructions . 

Processor 10 comprises a single integrated 

circuit superscalar processor, which, as discussed 
further below, includes various execution units, 
registers, buffers, memories, and other functional units 
that are all formed by integrated circuitry. As 
illustrated in Figure 1, processor 10 may be coupled to 

other devices, such as a system memory 12 and a second 

processor 10, by an interconnect fabric 14 to form a 

larger data processing system such as a workstation 
computer system. Processor 10 also includes an on-chip 

multi-level cache hierarchy including a unified level two 
(L2) cache 16 and bifurcated level one (LI) instruction 

(I) and data (D) caches 18 and 20, respectively. As is 

well-known to those skilled in the art, caches 16, 18 and 

2 0 provide low latency access to cache lines 

corresponding to memory locations in system memory 12 . 

Instructions are fetched for processing from LI 
I-cache 18 in response to the effective address (EA) 

residing in instruction fetch address register (IFAR) 30. 
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During each cycle, a new instruction fetch address may be 
loaded into I FAR 3 0 from one of three sources: branch 

prediction unit (BPU) 36, which provides speculative 

target path addresses resulting from the prediction of 
5 conditional branch instructions, global completion table 

(GCT) 38, which provides sequential path addresses, and 

branch execution unit (BEU) 92, which provides non- 
speculative addresses resulting from the resolution of 
predicted conditional branch instructions. If hit /miss 
io logic 22 determines, after translation of the EA 

n contained in I FAR 30 by effective-to-real address 

translation (ERAT) 32 and lookup of the real address (RA) 

y;j in I -cache directory 34, that the cache line of 

Vz instructions corresponding to the EA in I FAR 30 does not 

1^1? reside in LI I -cache 18, then hit/miss logic 22 provides 

3 _ the RA to L2 cache 16 as a request address via I -cache 

gl request bus 24. Such request addresses may also be 

l^i generated by prefetch logic within L2 cache 16 based upon 

D recent access patterns. In response to a request 

20~" address, L2 cache 16 outputs a cache line of 

instructions, which are loaded into prefetch buffer (PB) 
28 and LI I -cache 18 via I -cache reload bus 26, possibly 

after passing through optional predecode logic 144. 



25 Once the cache line specified by the EA in I FAR 

30 resides in LI cache 18, LI I -cache 18 outputs the 

cache line to both branch prediction unit (BPU) 36 and to 

instruction fetch buffer (IFB) 40. BPU 36 scans the 

cache line of instructions for branch instructions and 
30 predicts the outcome of conditional branch instructions, 

if any. Following a branch prediction, BPU 36 furnishes 
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a speculative instruction fetch address to I FAR 3 0 , as 
discussed above, and passes the prediction to branch 
instruction queue 64 so that the accuracy of the 
prediction can be determined when the conditional branch 
5 instruction is subsequently resolved by branch execution 

unit 92. 



IFB 40 temporarily buffers the cache line of 
instructions received from LI I -cache 18 until the cache 
io line of instructions can be translated by instruction 

n translation unit (ITU) 42. In the illustrated embodiment 

fi of processor 10, ITU 42 translates instructions from user 

=IJ instruction set architecture (UISA) instructions into a 

**ti possibly different number of internal ISA (USA) 

iSyj instructions that are directly executable by the 

^ execution units of processor 10. Such translation may be 

D performed, for example, by reference to microcode stored 

SJ in a read-only memory (ROM) template. In at least some 

flj embodiments, the UISA-to-IISA translation results in a 

20M different number of USA instructions than UISA 

instructions and/or USA instructions of different 
lengths than corresponding UISA instructions. The 
resultant USA instructions are then assigned by global 
completion table 38 to an instruction group, the members 
25 of which are permitted to be executed out-of-order with 

respect to one another. Global completion table 38 
tracks each instruction group for which execution has yet 
to be completed by at least one associated EA, which is 
preferably the EA of the oldest instruction in the 
30 instruction group. 



Following UISA-to-IISA instruction translation, 
instructions are dispatched in-order to one of latches 
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44, 46, 48 and 50 according to instruction type. That 
is, branch instructions and other condition register (CR) 
modifying instructions are dispatched to latch 44 , fixed- 
point and load- store instructions are dispatched to 
5 either of latches 46 and 48, and floating-point 

instructions are dispatched to latch 50. Each 
instruction requiring a rename register for temporarily 
storing execution results is then assigned one or more 
rename registers by the appropriate one of CR mapper 52, 
io link and count (LC) register mapper 54, exception 

n register (XER) mapper 56, general -purpose register (GPR) 

% mapper 58, and floating-point register (FPR) mapper 60. 

The dispatched instructions are then 
1%) temporarily placed in an appropriate one of CR issue 

^ queue (CRIQ) 62, branch issue queue (BIQ) 64, fixed-point 

C issue queues (FXIQs) 66 and 68, and floating-point issue 

^ queues (FPIQs) 70 and 72. From issue queues 62, 64, 66, 

68 ' 70 and 72 / instructions can be issued 
2cO opportunistically (i.e., possibly out-of-order) to the 

execution units of processor 10 for execution. The 
instructions, however, are maintained in issue queues 62- 
72 until execution of the instructions is complete and 

the result data, if any, are written back, in case any of 
25 the instructions needs to be reissued. 

As illustrated, the execution units of 
processor 10 include a CR unit (CRU) 90 for executing CR- 
modifying instructions, a branch execution unit (BEU) 92 
30 for executing branch instructions, two fixed-point units 

(FXUs) 94 and 100 for executing fixed-point instructions, 
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two load-store units (LSUs) 96 and 98 for executing load 
and store instructions, and two floating-point units 
(FPUs) 102 and 104 for executing floating-point 
instructions. Each of execution units 90-104 is 

5 preferably implemented as an execution pipeline having a 

number of pipeline stages. 

During execution within one of execution units 
90-104, an instruction receives operands, if any, from 
io one or more architected and/or rename registers within a 

register file coupled to the execution unit. When 
Q executing CR-modifying or CR- dependent instructions, CRU 

^ 90 and BEU 92 access the CR register file 80, which in a 

^ preferred embodiment contains a CR and a number of CR 

I5g rename registers that each comprise a number of distinct 

W fields formed of one or more bits. Among these fields 

J" are LT, GT, and EQ fields that respectively indicate if a 

S value (typically the result or operand of an instruction) 

is less than zero, greater than zero, or equal to zero. 
20fjj Link and count register (LCR) register file 82 contains a 

■IBS IS. 

S count register (CTR) , a link register (LR) and rename 

registers of each, by which BEU 92 may also resolve 
conditional branches to obtain a path address. General- 
purpose register files (GPRs) 84 and 86, which are 

25 synchronized, duplicate register files, store fixed-point 

and integer values accessed and produced by FXUs 94 and 
100 and LSUs 96 and 98. Floating-point register file 
(FPR) 88, which like GPRs 84 and 86 may also be 
implemented as duplicate sets of synchronized registers, 

30 contains floating-point values that result from the 

execution of floating-point instructions by FPUs 102 and 
104 and floating-point load instructions by LSUs 96 and 



AT9-99-451 



Page 12 



98. 



After an execution unit finishes execution of 
an instruction, the execution notifies GCT 38, which 

5 schedules completion of instructions in program order. 

To complete an instruction executed by one of CRU 90, 

FXUs 94 and 100 or FPUs 102 and 104, GCT 38 signals the 

execution unit, which writes back the result data, if 
any, from the assigned rename register (s) to one or more 
10 architected registers within the appropriate register 

iSa - file. The instruction is then removed from the issue 

i.g queue, and once all instructions within its instruction 

y j group have completed, is removed from GCT 38. Other 

oi types of instructions, however, are completed 

15^ differently. 

^ When BEU 92 resolves a conditional branch 

01 instruction and determines the path address of the 

{J; execution path that should be taken, the path address is 

20Q compared against the speculative path address predicted 

w by BPU 36. If the path addresses match, no further 

processing is required. If, however, the calculated path 
address does not match the predicted path address, BEU 92 

supplies the correct path address to I FAR 30. In either 

25 event, the branch instruction can then be removed from 

BIQ 64, and when all other instructions within the same 

instruction group have completed, from GCT 38. 

Following execution of a load instruction, the 
30 effective address computed by executing the load 

instruction is translated to a real address by a data 
ERAT (not illustrated) and then provided to LI D-cache 20 
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as a request address. At this point, the load operation 
is removed from FXIQ 66 or 68 and placed in load data 

queue (LDQ) 114 until the indicated load is performed. 

If the request address misses in LI D-cache 20, the 

5 request address is placed in load miss queue (LMQ) 116, 

from which the requested data is retrieved from L2 cache 
16, and failing that, from another processor 10 or from 

system memory 12. As described in detail below, LDQ 114 

ensures that data hazards are detected and appropriate 
io remedial action is taken such that the later of two load 

O instructions targeting the same address does not receive 

f:S older data than the earlier of the two load instructions. 

*s? ; 

gn Store instructions are similarly completed utilizing a 

m store queue (STQ) 110 into which effective addresses for 

lSyj stores are loaded following execution of the store 

^ instructions. From STQ 110, data can be stored into 

Q either or both of LI D-cache 20 and L2 cache 16. 

Ijf Referring now to Figure 2, there is depicted an 

2cQ exemplary embodiment of LDQ 114 of processor 10. As 

illustrated, LDQ 114 includes a number of entries, each 

including a effective address (EA) field 120 for storing 

the effective address (of address tag portion thereof) of 
a load instruction, a target address field 122 for 

25 storing the target address (or address tag portion 

thereof) from which the load instruction obtains data, a 
data field 124 for storing data loaded from memory by a 
load instruction, and a hazard field 126 for indicating 

that a hazard may exist for a load instruction. Entries 
30 within LDQ 114 are preferably allocated, updated, and 

deallocated by associated queue management logic 128 in 
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accordance with the processes depicted in Figures 3A and 
3B. 

With reference now to Figure 3A, there is 
5 illustrated a high level logical flowchart of an 

exemplary method by which queue management logic 128 of 
Figure 2 manages LDQ 114 in response to various stages in 
the local processing of load operations. As shown, the 
process begins at block 130 and then proceeds to block 

io 132 in response to queue management logic 128 receiving a 

f1 notification that a load instruction has been processed 

O at some stage of the execution pipeline between dispatch 

;^ and completion. In response to this notification, queue 

03 management logic 128 determines at block 132 whether the 

1571 load instruction has been dispatched, executed or 

4a completed by processor 10. In response to a 

determination that the load instruction has been 
0= dispatched from ITU 42 to one of latches 46 and 48, as 

fjj described above, the process proceeds to block 134. 

20»* Block 134 depicts queue management logic 126 allocating 

an entry in LDQ 114 for the newly dispatched instruction 
in accordance with the program order of the load 
instruction and placing the EA of the instruction within 
EA field 120. Thus, the location of an entry of a load 

25 instruction within LDQ 114 preferably indicates the 

program ordering of the load instruction with respect to 
other load instructions. Thereafter, the process returns 
to block 132. 



30 



Returning to block 132, in response to a 
determination that a load instruction has been completed 
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(together with other instructions in its instruction 
group) by GCT 38, queue management logic 126 deallocates 

the entry corresponding to the completed load 
instruction, for example, by identifying an entry having 
5 a matching EA. Thereafter, the process returns to block 

132 . 



If, on the other hand, queue management logic 
128 determines from the received notification at block 

io 132 that a load instruction has been executed by one of 

LSUs 96 and 98, the process proceeds to block 140, which 

S illustrates queue management logic 128 determining 

%1 whether a later entry in LDQ 114 than the entry allocated 

03 to the executed load instruction has a target address in 

15;! its target address field 122 that matches the target 

+* address of the executed load instruction. 

If not, queue management logic 128 places the 

flj target address of the executed load instruction in the 

2 CM target address field 122 of the associated entry and 

places the data retrieved from memory (i.e., local cache, 
remote cache, or system memory 12) in response to 

execution of the load instruction in data field 124 of 

the associated entry, as shown at block 142. The entry 

25 associated with the executed load instruction is also 

updated, as depicted at block 142, even if an entry 

associated with a later load instruction has a matching 
address if a determination is made at block 144 that 

hazard field 126 of the matching entry is not set. 

30 However, if hazard field 126 of the matching entry is 

set, a data hazard is detected. 
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As illustrated at block 146, to correct for the 
data hazard, queue management logic 128 places the target 
address for the executed load instruction in target 
address field 122 of the associated entry and utilizes 

the data contained in data field 124 of the matching 

entry of the later- in-program-order load to provide the 
data requested by the executed load instruction. That 
is, the data from data field 124 of the matching entry is 
provided to one of GPRs 84 and 86 as specified by the 

executed load instruction and is also placed into data 
field 124 of the entry in LDQ 114 associated with the 

executed load instruction. Thus, the operation of queue 
management logic 128 minimizes the performance penalty 

associated with data hazards since the earlier- in- 
program-order load instruction need not be re-executed to 
obtain the correct data (i.e., in this case, the same 
data as the later-in-program-order load) and no flush of 
instructions is required. Following block 146, the 

process returns to block 132. 

Referring now to Figure 3B, there is depicted a 

high level logical flowchart of an exemplary method by 
which queue management logic 128 of Figure 2 updates 
hazard fields 126 of LDQ 114 in response to detection of 

remote exclusive operations and corrects data hazards 
occasioned by the execution of local store operations. 
As depicted, the process begins at block 150 and then 
iterates at block 152 until notification is received of a 
locally executed store instruction or of an exclusive 
access (e.g., read-with-intent-to-modify, flush or kill) 
request on interconnect fabric 14 made by a remote 



AT9-99-451 



Page 17 



processor 10. In response to receipt of notification of 

local execution of a store instruction, the process 
passes to block 160, which is described below. However, 

in response to notification of an exclusive access 
5 request by a remote processor 10 7 the process proceeds 

from block 152 to block 154, which illustrates queue 

management logic 128 determining whether or not a target 

address specified by the remote exclusive access request 
matches the target address contained in target address 
10 field 122 of any entry within LDQ 114. If not, the 

O process simply returns to block 152, which has been 

described. 

However, in response to a determination that 
i&J the target address of the remote exclusive address 

r ^ request matches the address contained in the target 

P address field 122 of an entry in LDQ 114, queue 

management logic 128 sets hazard field 126 of the 

-jf matching entry, as shown at block 156, to indicate the 

2(D existence of a possible remotely- triggered data hazard 

for any earlier- in-program- order load instruction 
executed after the load instruction associated with 
matching entry. The existence of an actual data hazard 
is detected at blocks 140 and 144 of Figure 3A for remote 

25 exclusive operations and at blocks 160-164 of Figure 3B 

for local store operations. Following block 156, the 

process illustrated in Figure 3B returns to block 152 . 



30 



Referring again to block 152, in response to 
queue management logic 12 8 receiving notification of 
execution of a local store instruction, queue management 
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logic 128 determines at blocks 160 and 162 whether or not 

the target address of the store instruction matches a 
target address of an later- in-program-order but earlier 
executed load instruction in one of target address fields 
122 of LDQ 114. If not, the process simply returns to 

block 152, which has been described. However, in 

response to a determination that the target address of 
the store instruction matches a target address of an 
later- in-program-order but earlier executed load 
instruction, queue management logic 128 determines that a 

data hazard has occurred and corrects the data hazard by 
flushing at least the matching load instruction and any 
subsequent dependent instructions and by causing these 
instructions to be re-executed. Queue management logic 
12 8 also deallocates the entry in LDQ 114 allocated to 

the flushed load. 

With reference now to Figures 4A-4C, there are 

illustrated three block diagrams that together illustrate 
an exemplary operating scenario in which a data hazard 
caused by a remote exclusive operation is detected and 
corrected in accordance with the present invention. 
Referring first to Figure 4A, when the operating scenario 

begins, two load instructions, which are designated 
LDland LD2 in program order (with LD2 being the latest in 
program order) , have been dispatched and accordingly have 
been allocated entries in LDQ 114 by queue management 
logic 128. In addition, LD2 has been executed out-of- 
order with respect to LD1, and the target address (TA) 
and data (D) have been loaded into the appropriate entry 
of LDQ 114 by queue management logic 128. The hazard 

field 12 6 of the entry associated with each of the load 
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instructions is reset to 0 . 

Next, as shown in Figure 4B, in response to 
queue management logic 128 receiving notification of a 
5 remote request for exclusive access having a target 

address that matches the TA of LD2, hazard field 126 of 
the entry associated with LD2 is set to 1. Then, as 
indicated in Figure 4C, when LD1 is executed out-of-order 

and the execution generates a target address matching the 
io TA specified in target address field 124 of the entry 

O associated with LD2 , a data hazard is detected. 

~f£ Accordingly, queue management logic 128 provides the data 

41 from data field 124 of the entry corresponding to LD2 to 

j: the register file to satisfy LD1 and also records the 

i^f data in data field 124 and records the target address in 

s target address field 122 of the entry corresponding to 

m LD1. Thus, a data hazard caused by a remote exclusive 

Hi operation intervening between out-of-order executed loads 

% is detected and correctly without flushing or re- 

2<0 executing any instructions and without any additional 

latency. 

As has been described, the present invention 
provides an improved processor and method that reduces 
25 the performance penalty associated with data hazards by 

recording the data associated with out-of-order load 
instructions in a load data queue and then, in response 
to detection of a data hazard, utilizing that data to 
satisfy an earlier-in-program-order load instruction. 

30 

While the invention has been particularly shown 
and described with reference to a preferred embodiment, 
it will be understood by those skilled in the art that 
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various changes in form and detail may be made therein 
without departing from the spirit and scope of the 
invention. 
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CLAIMS 

What is claimed is: 



1 1. A method of processing an instruction in a processor, 

2 said method comprising: 

3 executing a first instruction and storing data 

4 associated with executing said first instruction; 

5 if said first instruction was executed out-of- 
m order, determining whether the stored data should have 

'ijl been used in execution of a preceding second instruction 

K executed after said first instruction; and 

PH if the stored data should have been used to 

iif~ execute the second instruction, using the stored data to 

iL perform an operation indicated by said second instruction 

i|i without re-executing said second instruction. 
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2. A processor, comprising: 

means for executing a first instruction and 
storing data associated with executing said first 
instruction; 

means for determining, if said first 
instruction was executed out-of-order, whether the stored 
data should have been used in execution of a preceding 
second instruction executed after said first instruction; 
and 

means for using the stored data to perform an 
operation indicated by said second instruction without 
re-executing said second instruction, if the stored data 
should have been used to execute the second instruction. 
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1 3. A processor, comprising: 

2 a register set; 

3 at least one execution unit that executes load 

4 instructions to transfer data into said register set; 

5 a load queue containing at least one entry, 

6 wherein said entry stores load data retrieved by a first 

7 load instruction; and 

8 queue management logic that, responsive to 
ki execution of a second load instruction, detects by 

lgi reference to said load queue whether a data hazard 

lifj exists, and if so, outputs said load data retrieved by 

13* said first load instruction from said entry to said 

i^J register set in accordance with said second load 

if*" instruction . 
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1 4. The processor of Claim 3, wherein said entry stores 

2 a target address of said first load instruction and has a 

3 hazard flag indicative of a possible data hazard, wherein 

4 said queue management logic detects that a data hazard 

5 exists if said second load instruction precedes said 

6 first instruction in program order and a target address 

7 of said second load instruction matches said target 

8 address stored in said entry and said hazard flag is set. 

1 5. The processor of Claim 3, wherein said queue 

2 management logic sets said hazard flag at least in 

3^ response to local store operation specifying said target 

ill address . 

;[W s. The processor of Claim 3, said register set 

Z~ comprising a general purpose register set. 

ill 7. The processor of Claim 3, wherein said queue 

%l management logic outputs said load data to a register in 

3 "" said register set that is specified by said second load 

4 instruction. 

1 8. The processor of Claim 3, wherein said queue 

2 management logic, responsive to detection of a data 

3 hazard, initiates reexecution of at least said first load 

4 instruction. 
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1 9. The processor of Claim 3, wherein said queue 

2 management logic allocates a respective entry within said 

3 load queue to each load instruction upon dispatch and, 

4 upon completion of said each load instruction, 

5 deallocates said respective entry. 
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1 10. A data processing system, comprising: 

2 an interconnect fabric; 

3 a memory coupled to said interconnect fabric; 

4 a register set; 

5 at least one execution unit that executes load 
s instructions to transfer data from said memory into said 
7 register set; 

M a load queue containing at least one entry, 

% wherein said entry stores load data retrieved by a first 
load instruction; and 

ij£ - queue management logic that, responsive to 

I 5 ** execution of a second load instruction, detects by 

iSj reference to said load queue whether a data hazard 

iflj exists, and if so, outputs said load data retrieved by 

i|j said first load instruction from said entry to said 

i|] register set in accordance with said second load 

17 instruction. 



AT9-99-451 



Page 27 



11. The data processing system of Claim 10, wherein said 
entry stores a target address of said first load 
instruction and has a hazard flag indicative of a 
possible data hazard, wherein said queue management logic 
detects that a data hazard exists if said second load 
instruction precedes said first instruction in program 
order and a target address of said second load 
instruction matches said target address stored in said 
entry and said hazard flag is set. 

12. The data processing system of Claim 11, wherein said 
queue management logic sets said hazard flag at least in 
response to local store operation specifying said target 
address . 

13. The data processing system of Claim 12, wherein said 
at least one execution unit, said register set and said 
load queue comprise a first processor and said data 
processing system includes a second processor, wherein 
said queue management logic also sets said hazard flag in 
response to said second processor issuing an exclusive 
access operation specifying said target address on said 
interconnect fabric . 

14. The data processing system of Claim 10, said 
register set comprising a general purpose register set. 
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1 15. The data processing system of Claim 10, wherein said 

2 queue management logic outputs said load data to a 

3 register in said register set that is specified by said 

4 second load instruction. 



1 16. The data processing system of Claim 10, wherein said 

2 queue management logic, responsive to detection of a data 

3 hazard, initiates reexecution of at least said first load 

4 instruction. 



3|3 17. The data processing system of Claim 10, wherein said 

%l queue management logic allocates a respective entry 

within said load queue to each load instruction upon 
f* dispatch and, upon completion of said each load 

sir instruction, deallocates said respective entry. 
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1 18. A method of executing load instructions out-of-order 

2 in a processor having a register set and a load queue, 

3 said method comprising: 

4 storing, in an entry in said load queue, load 

5 data retrieved from memory in response to executing a 

6 first load instruction; 

7 in response to execution of a second load 

8 instruction, detecting by reference to said load queue 

9 whether a data hazard exists; and 

lcn in response to detection of a data hazard, 

ill outputting said load data retrieved by said first load 

liff instruction from said entry to said register set in 

ii* accordance with said second load instruction. 
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19. The method of Claim 18, wherein said entry stores a 
target address of said first load instruction and has a 
hazard flag indicative of a possible data hazard, wherein 
detecting that a data hazard exists comprises determining 
if said second load instruction precedes said first 
instruction in program order and a target address of said 
second load instruction matches said target address 
stored in said entry and said hazard flag is set. 

20. The method of Claim 19, and further comprising 
setting said hazard flag at least in response to a local 
store operation specifying said target address. 

21. The method of Claim 19, wherein outputting said load 
data comprises outputting said load data to a register in 
said register set that is specified by said second load 
instruction. 



22. The method of Claim 19, and further comprising: 

in response to detection of a data hazard, 
initiating reexecution of at least said first load 
instruction. 



23. The method of Claim 19, and further comprising 
allocating a respective entry within said load queue to 
each load instruction upon dispatch and, upon completion 
of said each load instruction, deallocating said 
respective entry. 
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ABSTRACT OF THE DISCLOSURE 
PROCESSOR AND METHOD OF EXECUTING LOAD INSTRUCTIONS OUT- 
OF-ORDER HAVING REDUCED HAZARD PENALTY 

A processor having a reduced data hazard 
penalty includes a register set, at least one execution 
unit that executes load instructions to transfer data 
into the register set, and a load queue. The load queue 
contains at least one entry, and each occupied entry in 
the load queue stores load data retrieved by an executed 
load instruction in association with a target address of 
the executed load instruction. The load queue has 
associated queue management logic that, in response to 
execution by the execution unit of a load instruction, 
determines by reference to the load queue whether a data 
hazard exists for the load instruction. If so, the queue 
management logic outputs load data from the load queue to 
the register set in accordance with the load instruction, 
thus eliminating the need to flush and re-execute the 
load instruction. 
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DECLARATION AND POWER OF ATTORNEY FOR 
PATENT APPLICATION 



As a below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next 
to my name; 

I believe I am the original, first and sole inventor (if only one name is 
listed below) or an original, first and joint inventor (if plural names are 
listed below) of the subject matter which is claimed and for which a patent is 
sought on the invention entitled 

PROCESSOR AND METHOD OF EXECUTING LOAD INSTRUCTIONS OUT-OF-ORDER HAVING REDUCED 
HAZARD PENALTY 

the specification of which (check one) 
X is attached hereto. 

was filed on 

as Application Serial No. 

and was amended on 

(if applicable) 

I hereby state that I have reviewed and understand the contents of the above 
identified specification, including the claims, as amended by any amendment 
referred to above. 

I acknowledge the duty to disclose information which is material to the 
patentability of this application in accordance with Title 37, Code of Federal 
Regulations, §1 . 56 . 

I hereby claim foreign priority benefits under Title 35, United States Code, §119 
of any foreign application (s) for patent or inventor's certificate listed below 
and have also identified below any foreign application for patent or inventor's 
certificate having a filing date before that of the application on which 
priority is claimed: 

Prior Foreign Application (s) : Priority Claimed 

Yes No 

(Number) (Country) (Day /Month/ Year) 



I hereby claim the benefit under Title 35, United States Code, §120 of any United 
States application (s) listed below and, insofar as the subject matter of each of 
the claims of this application is not disclosed in the prior United States 
application in the manner provided by the first paragraph of Title 35, United 
States Code, §112, I acknowledge the duty to disclose information material to 
the patentability of this application as defined in Title 37, Code of Federal 
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Regulations, §1.56 which occurred between the filing date of the prior 
application and the national or PCT international filing date of this 
application: 



(Application Serial #) (Filing Date) (Status) 



I hereby declare that all statements made herein of my own knowledge are true and 
that all statements made on information and belief are believed to be true; and 
further that these statements were made with the knowledge that willful false 
statements and the like so made are punishable by fine or imprisonment, or both, 
under Section 1001 of Title 18 of the United States Code and that such willful 
false statements may jeopardize the validity of the application or any patent 
issued thereon. 

POWER OF ATTORNEY: As a named inventor, I hereby appoint the following attorneys 
and/or agents to prosecute this application and transact all business in the 
Patent and Trademark Office connected therewith. 

John W. Henderson, Jr., Reg. No. 26,907; Thomas E . Tyson, Reg. No. 28,543; Robert 
M. Carwell, Reg. No. 28,499; Jeffrey S. LaBaw, Reg. No. 31,633; Douglas H. 
Lefeve, Reg. No. 26,193; Casimer K. Salys, Reg. No. 28,900; David A. Mims, Jr., 
Reg. No. 32,708; Volel Emile, Reg. No. 39,969; James H. Barksdale, Jr. Reg. No. 
24,091; Anthony V. England, Reg. No. 35,129; Leslie A. Van Leeuwen, Reg. No. 
42,196; Marilyn S. Dawkins, Reg. No. 31,140; Mark E. McBurney, Reg. No. 33,114; 
Christopher A. Hughes, Reg. No. 26,914; Edward A. Pennington, Reg. No. 32,588; 
John E. Hoel, Reg. No. 26,279; Joseph C. Redmond, Jr., Reg. No. 18,753; Andrew 
J. Dillon, Reg. No. 29,634; Max Ciccarelli, Reg. No. 39,454; Daniel E. Venglarik, 
Reg. No. 39,409; Jack V. Musgrove, Reg. No. 31,986; Brian F. Russell, Reg. No. 
40,796; Steven Lin, Reg. No. 35,250; Matthew W. Baca, Reg. No. 42,277; Antony P. 
Ng, Reg. No. 43,427; John G. Graham, Reg. No. 19,563; Matthew S. Anderson, Reg. 
No. 39,093; Michael R. Barre, Reg. No. 44,023; Andrew Mitchell Harris, Reg. No. 
42,638; Richard McCain, Reg. No. 43,785; Michael Noe, Reg. No. 44,975; and 
Sidney L. Weatherford, Reg. No. P-45,602. 

Send correspondence to: Andrew J. Dillon, FELSMAN, BRADLEY, VADEN, GUNTER & 
DILLON, LLP, Suite 350 Lakewood on the Park, 7600B North Capital of Texas 
Highway, Austin, Texas 78 731, and direct all telephone calls to Andrew J. Dillon, 
(512) 343-6116. 
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